This
notebook is Part 2 of “Introductory Big Data Analyses to gain insights
in Precision Medicine” course.
R setup | Install software
R setup | Packages
install.packages('dplyr', depencies=TRUE)
library(dplyr)
Download data and open file in R
Download data from exorbase
website.
Open file in R
hcc_rna<-read.delim(file.choose(),header=TRUE)
Other functions to open file

R packages for Data wrangling
- A typical data science project :

Reference: Grolemund, G., & Wickham, H. (2017). R for Data
Science. O’Reilly Media.
Step 1 Data wrangling using dplyr packages
1. Load dplyr package
library(dplyr)
3. Arrange rows
# arrange (data frame, column name)
hcc_sort<-arrange(hcc_rna, Gene.symbol)
#sort by more than one variable
hcc_sort<-arrange(hcc_rna, HCC001,Gene.symbol)
4. Select columns
# select (data frame, column name 1, column name 2)
hcc_select<-select(hcc_rna, Gene.symbol, HCC004)
5. Select all columns except specific column
# select (data frame, -columns that you don't want)
hcc_all<-select(hcc_rna, -c(Gene.symbol, HCC004))
6. Recode data
eg. Calculate how many patients that have RNA expression >0 and
store this value in a new column called “counthighexp”
# mutate (data frame, new column name = functions)
hcc_highexp<-mutate(hcc_rna,counthighexp=rowSums(hcc_rna>0))
eg. Recode numeric values to categories
# mutate (data frame, new column name = functions)
hcc_highexp <-mutate(hcc_highexp, cat = case_when(
counthighexp < 50 ~ "low",
counthighexp >= 50 ~ "high"))
7. Summarize data
summarize(hcc_highexp,mean(counthighexp))
8. Handle relational data (Multiple tables of data)




Reference: Grolemund, G., & Wickham, H. (2017). R for Data
Science. O’Reilly Media.
# inner_join (data frame 1, data frame 2, by="key")
hcc_healthy<-inner_join(hcc_rna,healthy_rna, by="Gene.symbol")
Exercise 1
Generate a mastersheet by merging hcc_rna
dataset and healthy_rna dataset. Name the new
mastersheet as hcc_healthy.
Question:
What type of join will you use for this dataset?
Summarize number of genes from
hcc_healthy.
In hcc_healthy, use Gene.symbol column as row
names of the table.
#without any package: data.frame(dataset name, row.names = column number)
#using tidyverse package: column_to_rownames(dataset name , var="key")
- Then, remove genes that have 0 expression in 90% of samples in the
dataset. Name the dataset as hcc_healthy_highexp
- Summarize the number of genes that have expression value >0 in
90% of the samples.
- Normalize all gene expressions using the log2 transformation. Name
the dataset as hcc_healthy_log
#log((data frame+1),2)
Exercise 1 answers
#merge hcc_rna with healthy_rna
hcc_healthy<-full_join(hcc_rna,healthy_rna)
#35517 genes
#convert gene symbol column to rownames
hcc_healthy <- data.frame(hcc_healthy, row.names = 1)
#remove genes that has 0 expression in 90% of samples in both datasets
hcc_healthy<-mutate(hcc_healthy,count0=rowSums(hcc_healthy<=0))
hcc_healthy_highexp<-filter(hcc_healthy,count0<=207)
hcc_healthy_highexp<-select(hcc_healthy_highexp,-count0)
#18,575 genes left
#log transformation of the data
hcc_healthy_log<-log((hcc_healthy_highexp+1),2)
About Data normalization
- Log transformation - to moderate the variance across the mean

- Reason to use Log2 instead of Log10

- Reason for expression value + 1 before log transformation

Step 3a Data visualization (Basic histogram)
R packages for data visualizations
#Install packages
install.packages("ggplot2",dependencies=TRUE)
install.packages("ggpubr",dependencies=TRUE)
#Load packages
library(ggplot2)
library(ggpubr)

Reference: https://rstudio.github.io/cheatsheets/data-visualization.pdf
Let’s visualize the the dataset before and after normalization
- Transpose the dataset
#Transpose the datasets so that the RNA names are in columns while samples are in rows
raw_trans<-data.frame(t(hcc_healthy))
normalize_trans<-data.frame(t(hcc_healthy_log))
- Draw histogram using ggplot
#select one of the RNA (ABCB1 as example)
#use ggplot to draw histogram and modify the binwidth and colour
rawgraph<-ggplot(raw_trans,aes(x=ABCB1))+geom_histogram(binwidth=0.5,fill="deepskyblue")
normalizegraph<-ggplot(normalize_trans,aes(x=ABCB1))+geom_histogram(binwidth=0.5,fill="deepskyblue")
- Place the histogram side by side for better comparison
ggarrange(rawgraph,normalizegraph,ncol=2,nrow=1,labels=c("raw","normalize"))
Useful resources to explore further

Save the R codes and environment - for next
section!
---
title: "Data Wrangling notebook"
output: 
  html_notebook:
    toc: yes
    theme: readable
---

> ###### *This notebook is Part 2 of "Introductory Big Data Analyses to gain insights in Precision Medicine" course.*

## R setup \| Install software

-   Download and install R [\<https://cloud.r-project.org/bin/windows/base/\>](https://cloud.r-project.org/bin/windows/base/){.uri}

-   Download and install Rstudio [\<https://www.rstudio.com/products/rstudio/download/\>](https://www.rstudio.com/products/rstudio/download/){.uri}

## R setup \| Packages

-   Install package

```{r}
install.packages('dplyr', depencies=TRUE)
```

-   Load package

```{r}
library(dplyr)
```

## Download data and open file in R

1.  Download data from [exorbase](http://www.exorbase.org/exoRBaseV2/download/toIndex) website.

2.  Open file in R

```{r}
hcc_rna<-read.delim(file.choose(),header=TRUE)
healthy_rna<-read.delim(file.choose(),header=TRUE)
```

#### Other functions to open file

![](Images/Altcodesreadfile.jpg){width="459"}

## R packages for Data wrangling

-   A typical data science project :

![](Images/overview%20of%20datasci.jpg){width="536"}

*Reference: Grolemund, G., & Wickham, H. (2017). R for Data Science. O'Reilly Media.*

-   To manipulate the data for data visualization/insights, we can use "dplyr" or "tidyverse" package in R.

    1.  dplyr

        ![](Images/dplyr.jpg){width="340"}

        *Reference: <https://dplyr.tidyverse.org/>*

    2.  Tidyverse - collection of R packages for data science

        (dplyr is one of the core packages in tidyverse)

        ![](Images/tidyverse%20package.jpg){width="380"}

        *Reference: <https://www.tidyverse.org/>*

------------------------------------------------------------------------

## Step 1 Data wrangling using dplyr packages

#### 1. Load dplyr package

```{r}
library(dplyr)
```

#### 2. Filter rows

```{r}
# filter ( data frame, column name == value) 
# can use <, <=, >, >=
FGR_expression<-filter(hcc_rna, Gene.symbol=="FGR")
hcc004_expression<-filter(hcc_rna, HCC004<=1)

#if you have multiple criteria
# can use | (or), & (and)
hcc004_hcc002_expression<-filter(hcc_rna, HCC004<=1 & HCC002>1)
```

![](Images/filter%20codes.jpg){width="416"}

*Reference:* [*https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf*](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

#### 3. Arrange rows

```{r}
# arrange (data frame, column name)
hcc_sort<-arrange(hcc_rna, Gene.symbol)

#sort by more than one variable
hcc_sort<-arrange(hcc_rna, HCC001,Gene.symbol)
```

#### 4. Select columns

```{r}
# select (data frame, column name 1, column name 2)
hcc_select<-select(hcc_rna, Gene.symbol, HCC004)
```

#### 5. Select all columns except specific column

```{r}
# select (data frame, -columns that you don't want)
hcc_all<-select(hcc_rna, -c(Gene.symbol, HCC004))
```

#### 6. Recode data

###### eg. Calculate how many patients that have RNA expression \>0 and store this value in a new column called "counthighexp"

```{r}
# mutate (data frame, new column name = functions)
hcc_highexp<-mutate(hcc_rna,counthighexp=rowSums(hcc_rna>0))
```

###### eg. Recode numeric values to categories

```{r}
# mutate (data frame, new column name = functions)
hcc_highexp <-mutate(hcc_highexp, cat = case_when(
                         counthighexp < 50 ~ "low",
                         counthighexp >= 50 ~ "high"))
```

#### 7. Summarize data

```{r}
summarize(hcc_highexp,mean(counthighexp))
```

#### 8. Handle relational data (Multiple tables of data)

![](Images/understand%20join.jpg){width="148"}

![](Images/Innerjoin.jpg){width="245"}

![](Images/Outerjoin.jpg){width="295"}

![](Images/Filtering%20join.jpg){width="348"}

*Reference: Grolemund, G., & Wickham, H. (2017). R for Data Science. O'Reilly Media.*

```{r}
# inner_join (data frame 1, data frame 2, by="key")
hcc_healthy<-inner_join(hcc_rna,healthy_rna, by="Gene.symbol")
```

------------------------------------------------------------------------

### Exercise 1

1.  Generate a mastersheet by merging **hcc_rna** dataset and **healthy_rna** dataset. Name the new mastersheet as **hcc_healthy**.

    ###### *Question: What type of join will you use for this dataset?*

2.  Summarize number of genes from **hcc_healthy**.

3.  In **hcc_healthy**, use Gene.symbol column as row names of the table.

```{r}
#without any package: data.frame(dataset name, row.names = column number)
#using tidyverse package: column_to_rownames(dataset name , var="key") 
```

4.  Then, remove genes that have 0 expression in 90% of samples in the dataset. Name the dataset as **hcc_healthy_highexp**
5.  Summarize the number of genes that have expression value \>0 in 90% of the samples.
6.  Normalize all gene expressions using the log2 transformation. Name the dataset as **hcc_healthy_log**

```{r}
#log((data frame+1),2)
```

#### Exercise 1 answers

```{r}
#merge hcc_rna with healthy_rna
hcc_healthy<-full_join(hcc_rna,healthy_rna)
#35517 genes

#convert gene symbol column to rownames
hcc_healthy <- data.frame(hcc_healthy, row.names = 1)

#remove genes that has 0 expression in 90% of samples in both datasets
hcc_healthy<-mutate(hcc_healthy,count0=rowSums(hcc_healthy<=0))
hcc_healthy_highexp<-filter(hcc_healthy,count0<=207)
hcc_healthy_highexp<-select(hcc_healthy_highexp,-count0)
#18,575 genes left

#log transformation of the data
hcc_healthy_log<-log((hcc_healthy_highexp+1),2)
```

## About Data normalization

-   Log transformation - to moderate the variance across the mean

![](Images/Data%20normalization.jpg){width="616"}

-   Reason to use Log2 instead of Log10

![](Images/log2reason.jpg){width="614"}

-   Reason for expression value + 1 before log transformation

![](Images/Data%20normalization2.jpg){width="618"}

------------------------------------------------------------------------

## Step 3a Data visualization (Basic histogram)

### R packages for data visualizations

```{r}
#Install packages
install.packages("ggplot2",dependencies=TRUE)
install.packages("ggpubr",dependencies=TRUE)

#Load packages
library(ggplot2)
library(ggpubr)
```

![](Images/GGplot2.jpg)

*Reference:* [*https://rstudio.github.io/cheatsheets/data-visualization.pdf*](https://rstudio.github.io/cheatsheets/data-visualization.pdf)

Let's visualize the the dataset before and after normalization

1.  Transpose the dataset

```{r}
#Transpose the datasets so that the RNA names are in columns while samples are in rows
raw_trans<-data.frame(t(hcc_healthy))
normalize_trans<-data.frame(t(hcc_healthy_log))
```

2.  Draw histogram using ggplot

```{r}
#select one of the RNA (ABCB1 as example)
#use ggplot to draw histogram and modify the binwidth and colour
rawgraph<-ggplot(raw_trans,aes(x=ABCB1))+geom_histogram(binwidth=0.5,fill="deepskyblue")
normalizegraph<-ggplot(normalize_trans,aes(x=ABCB1))+geom_histogram(binwidth=0.5,fill="deepskyblue")
```

3.  Place the histogram side by side for better comparison

```{r}
ggarrange(rawgraph,normalizegraph,ncol=2,nrow=1,labels=c("raw","normalize"))
```

### Useful resources to explore further

![](Images/GGplot2_otherresources.jpg)

***Save the R codes and environment - for next section!***

------------------------------------------------------------------------

## Additional information that may be useful in data wrangling

-   Introduction to Tidyverse package

Data frame -\> tibble in Tidyverse package

![](Images/Tibble.jpg){width="418"}

[`tibble()`](https://tibble.tidyverse.org/reference/tibble.html) does much less than [`data.frame()`](https://rdrr.io/r/base/data.frame.html): it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, it only recycles inputs of length 1, and it never creates `row.names()`

*Reference: [https://tibble.tidyverse.org/#:\~:text=Tibbles%20are%20data.,a%20variable%20does%20not%20exist)..){.uri}](https://tibble.tidyverse.org/#:~:text=Tibbles%20are%20data.,a%20variable%20does%20not%20exist)..){.uri})*

1.  Load tidyverse package

```{r}
library(tidyverse)
```

2.  Convert data frame to tibble

```{r}
hcc_tbl<-as_tibble(hcc_rna)
healthy_tbl<-as_tibble(healthy_rna)
annotation_tbl<-as_tibble(annotation)
```

or **read_csv** instead of **read.lim** or **read.csv** to import data as tibble

3.  The power of pipes in tidyverse

```{r}
#For example, if you want to select column then filter, normally this is what you do:
hcc_select<-select(hcc_rna, Gene.symbol, HCC004)
hcc_filter<-filter(hcc_select, HCC004>=1000)

#with pipes
hcc_tocompare<-hcc_rna%>%select(Gene.symbol, HCC004)%>%filter(HCC004>=1000)
```

-   To specify column names

```{r}
#colnames(dataset name)<-dataset[first row,]
```

-   To change data type of a column (eg. character to numeric)

```{r}
#dataset$column name<-as.numeric(dataset$column name)
```

-   To remove duplicates

```{r}
#using pipes:
#dataset name %>% distinct (specific column name, .keep_all=TRUE)

#without using pipes:
#distinct(dataset name, specific column, .keep_all=TRUE)
```

------------------------------------------------------------------------

The End of Part 2
